Morphological Reconstruction for Word Level Script Identification
نویسندگان
چکیده
A line of a bilingual document page may contain text words in regional language and numerals in English. For Optical Character Recognition (OCR) of such a document page, it is necessary to identify different script forms before running an individual OCR system. In this paper, we have identified a tool of morphological opening by reconstruction of an image in different directions and regional descriptors for script identification at word level, based on the observation that every text has a distinct visual appearance. The proposed system is developed for three Indian major bilingual documents, Kannada, Telugu and Devnagari containing English numerals. The nearest neighbour and k-nearest neighbour algorithms are applied to classify new word images. The proposed algorithm is tested on 2625 words with various font styles and sizes. The results obtained are quite encouraging
منابع مشابه
Word level script identification for scanned document images
In this paper, we compare the performance of three classifiers used to identify the script of words in scanned document images. In both training and testing, a Gabor filter is applied and 16 channels of features are extracted. Three classifiers (Support Vector Machines (SVM), Gaussian Mixture Model (GMM) and k -Nearest-Neighbor (k -NN)) are used to identify different scripts at the word level (...
متن کاملWord level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script
India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to identify the scripts of handwritten text correctly. In this paper, we present a system, which automatically separates the scripts of handwritten words from a docum...
متن کاملA Multiple Feature based Novel Approach for Identification of Printed Indian Scripts at Word Level
In a country like India where different scripts are in use, automatic identification of printed script facilitates many important applications such as automatic transcription of multilingual documents and for the selection of script specific OCR in a multilingual environment. In this paper a novel method to identify the script type of the collection of documents printed in seven Indian language...
متن کاملRecurrent neural networks based Indic word-wise script identification using character-wise training
This paper presents a novel methodology of Indic handwritten script recognition using Recurrent Neural Networks and addresses the problem of script recognition in poor data scenarios, such as when only character level online data is available. It is based on the hypothesis that curves of online character data comprise sufficient information for prediction at the word level. Online character dat...
متن کاملConvolution Based Technique for Indic Script Identification from Handwritten Document Images
Determination of script type of document image is a complex real life problem for a multi-script country like India, where 23 official languages (including English) are present and 13 different scripts are used to write them. Including English and Roman those count become 23 and 13 respectively. The problem becomes more challenging when handwritten documents are considered. In this paper an app...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1106.5156 شماره
صفحات -
تاریخ انتشار 2007